19 research outputs found
Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition
In this paper we address the problem of human action recognition from video
sequences. Inspired by the exemplary results obtained via automatic feature
learning and deep learning approaches in computer vision, we focus our
attention towards learning salient spatial features via a convolutional neural
network (CNN) and then map their temporal relationship with the aid of
Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a
deep fusion framework that more effectively exploits spatial features from CNNs
with temporal features from LSTM models. We also extensively evaluate their
strengths and weaknesses. We find that by combining both the sets of features,
the fully connected features effectively act as an attention mechanism to
direct the LSTM to interesting parts of the convolutional feature sequence. The
significance of our fusion method is its simplicity and effectiveness compared
to other state-of-the-art methods. The evaluation results demonstrate that this
hierarchical multi stream fusion method has higher performance compared to
single stream mapping methods allowing it to achieve high accuracy
outperforming current state-of-the-art methods in three widely used databases:
UCF11, UCFSports, jHMDB.Comment: Published as a conference paper at WACV 201
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Towards On-Board Panoptic Segmentation of Multispectral Satellite Images
With tremendous advancements in low-power embedded computing devices and
remote sensing instruments, the traditional satellite image processing pipeline
which includes an expensive data transfer step prior to processing data on the
ground is being replaced by on-board processing of captured data. This paradigm
shift enables critical and time-sensitive analytic intelligence to be acquired
in a timely manner on-board the satellite itself. However, at present, the
on-board processing of multi-spectral satellite images is limited to
classification and segmentation tasks. Extending this processing to its next
logical level, in this paper we propose a lightweight pipeline for on-board
panoptic segmentation of multi-spectral satellite images. Panoptic segmentation
offers major economic and environmental insights, ranging from yield estimation
from agricultural lands to intelligence for complex military applications.
Nevertheless, the on-board intelligence extraction raises several challenges
due to the loss of temporal observations and the need to generate predictions
from a single image sample. To address this challenge, we propose a multimodal
teacher network based on a cross-modality attention-based fusion strategy to
improve the segmentation accuracy by exploiting data from multiple modes. We
also propose an online knowledge distillation framework to transfer the
knowledge learned by this multi-modal teacher network to a uni-modal student
which receives only a single frame input, and is more appropriate for an
on-board environment. We benchmark our approach against existing
state-of-the-art panoptic segmentation models using the PASTIS multi-spectral
panoptic segmentation dataset considering an on-board processing setting. Our
evaluations demonstrate a substantial increase in accuracy metrics compared to
the existing state-of-the-art models
Learning Through Guidance: Knowledge Distillation for Endoscopic Image Classification
Endoscopy plays a major role in identifying any underlying abnormalities
within the gastrointestinal (GI) tract. There are multiple GI tract diseases
that are life-threatening, such as precancerous lesions and other intestinal
cancers. In the usual process, a diagnosis is made by a medical expert which
can be prone to human errors and the accuracy of the test is also entirely
dependent on the expert's level of experience. Deep learning, specifically
Convolution Neural Networks (CNNs) which are designed to perform automatic
feature learning without any prior feature engineering, has recently reported
great benefits for GI endoscopy image analysis. Previous research has developed
models that focus only on improving performance, as such, the majority of
introduced models contain complex deep network architectures with a large
number of parameters that require longer training times. However, there is a
lack of focus on developing lightweight models which can run in low-resource
environments, which are typically encountered in medical clinics. We
investigate three KD-based learning frameworks, response-based, feature-based,
and relation-based mechanisms, and introduce a novel multi-head attention-based
feature fusion mechanism to support relation-based learning. Compared to the
existing relation-based methods that follow simplistic aggregation techniques
of multi-teacher response/feature-based knowledge, we adopt the multi-head
attention technique to provide flexibility towards localising and transferring
important details from each teacher to better guide the student. We perform
extensive evaluations on two widely used public datasets, KVASIR-V2 and
Hyper-KVASIR, and our experimental results signify the merits of our proposed
relation-based framework in achieving an improved lightweight model (only 51.8k
trainable parameters) that can run in a resource-limited environment
Deep learning for human action understanding
This thesis addresses the problem of understanding human behaviour in videos in multiple problem settings including, recognition, segmentation, and prediction. Considering the complex nature of human behaviour, we propose to capture both short-term and long-term context in the given videos and propose novel multitask learning-based approaches to solve the action prediction task, as well as an adversarially-trained approach to action recognition. We demonstrate the efficacy of these techniques by applying them to multiple real-world human behaviour understanding settings including, security surveillance, sports action recognition, group activity recognition and recognition of cooking activities
Fuzzy logic based mobile robot target tracking in dynamic hostile environment
With the increasing number of applications, mobile robots are required to work under challenging conditions where the environment is cluttered with moving obstacles and hostile regions. In this paper we propose a fuzzy logic based control system for mobile robot target tracking and obstacle avoidance in a dynamic hostile environment. Given the existing body of research results in the field of obstacle avoidance and path planning, which is reviewed in this context, particular attention is paid to integrate computer vision based sensing mechanisms to robust fuzzy logic based navigation control method. Depth and colour information for both navigation and target tracking are to be captured using a Asus Xtion PRO sensor, which provides RGB colour and 3D depth imaging data. The fuzzy logic based navigation control algorithm is implemented to control obstacle avoidance, hostile region avoidance and target tracking. The effectiveness of the proposed approach was verified through several experiments, which demonstrates the feasibility of a fuzzy target tracker as well as the extensible obstacle and hostile region avoidance system.</p
Hierarchical Attention Network for Action Segmentation
Temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the capacity to effectively map the temporal relationships in between the frames as they only capture a limited span of temporal dependencies. To this end we propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time, thus improving the overall segmentation performance. The proposed hierarchical recurrent attention framework analyses the input video at multiple temporal scales, to form embeddings at frame level and segment level, and perform fine-grained action segmentation. This generates a simple, lightweight, yet extremely effective architecture for segmenting continuous video streams and has multiple application domains. We evaluate our system on multiple challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets and achieves state-of-the-art performance. The evaluated datasets encompass numerous video capture settings which are inclusive of static overhead camera views and dynamic, ego-centric head-mounted camera views, demonstrating the direct applicability of the proposed framework in a variety of settings.</p
Fine-grained action segmentation using the semi-supervised action GAN
In this paper we address the problem of continuous fine-grained action segmentation, in which multiple actions are present in an unsegmented video stream. The challenge for this task lies in the need to represent the hierarchical nature of the actions and to detect the transitions between actions, allowing us to localise the actions within the video effectively. We propose a novel recurrent semi-supervised Generative Adversarial Network (GAN) model for continuous fine-grained human action segmentation. Temporal context information is captured via a novel Gated Context Extractor (GCE) module, composed of gated attention units, that directs the queued context information through the generator model, for enhanced action segmentation. The GAN is made to learn features in a semi-supervised manner, enabling the model to perform action classification jointly with the standard, unsupervised, GAN learning procedure. We perform extensive evaluations on different architectural variants to demonstrate the importance of the proposed network architecture, and show that it is capable of outperforming current state-of-the-art on three challenging datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities dataset.</p